ON THE GEOMETRY OF MAXIMUM ENTROPY PROBLEMS* 



MICHELE PAVON t AND AUGUSTO FERRANTE * 



Abstract. We show that a simple geometric result suffices to derive the form of the optimal 
solution in a large class of finite and infinite-dimensional maximum entropy problems concerning 
probability distributions, spectral densities and covariance matrices. These include Burg's spectral 
estimation method and Dempster's covariance completion, as well as various recent generalizations 
of the above. We then apply this orthogonality principle to the new problem of completing a block- 
circulant covariance matrix when an a priori estimate is available. 
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1. Prelude: Four famous maximum entropy problems. In this section, we 
briefly review four classical maximum entropy problems that have played an important 
role in the history of various scientific areas. These are namely problems where entropy 
is maximized under linear constraints. We shall later derive the form of the optimal 
solution in three of these problems by the same geometric principle (Theorem 3.3 in 
Section [3| . 

1.1. 1877: Boltzmann's loaded dice. In 1877, Boltzmann 8, p. 169] posed 
the following question: Consider N molecules that can only take the following p + 1 
values of kinetic energy [j] 0, e, 2e, . . . ,pe. Suppose Uj molecules have kinetic energy 
ie, i — 0,1,..., p. We then have a "macrostate" , a "Zustandverteilung" in Boltz- 
mann's languag^j indexed by (n , n±, . . . , n p ) corresponding to the multinomial co- 
efficient 

AT! 



no\n\ \ . . . n p \ 

"microstates" each having probability (p+ 1)~ N . Suppose that the sum of the kinetic 
energy of all molecules is a given quantity Ae = L. Boltzmann proceeded to find the 
macrostate which corresponds to more microstates, namely that has highest probabil- 
ity, among those having total kinetic energy L. This is, to the best of our knowledge, 
the first maximum entropy problem in history. 

Boltzmann's problem was popularized in the following form [651 127) . Suppose N 
dice are rolled and we are informed that the total number of spots is N ■ 4.5. We are 
asked: What proportion of the dice are showing face i,i = 1, 2, . . . , 6? The number of 
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1 "lebendige Kraft" , the classical vis viva originating with Gottfried Leibniz which was actually 
twice the kinetic energy. 

2 the expression "Komplexion" in [8] refers instead to a microstate and not to a macrostate as 
stated in \E9\ Section 4] . 
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different ways that N dice can fall so that dice show face i is given by 

AT! " 

(i.i -r-j — E n * = iV - 



Again, the "macrostate" (m, fi2, ■ . . , rig) corresponds to — 



Nl 



"microstates" each 

having probability 6~ iv . To find the most probable macrostate, we need to maximize 
the multinomial coefficient (1.1) under the constraint 

6 

(1.2) ^i-rii = AT -4.5. 

i=l 



This procedure will yield the macrostate, among those satisfying (1.2), that can be 
realized in more ways. Assuming that A^ is large, we now use a crude version of 
Stirling's approximation N\ 



e ' J N j \ We get 



AH 



-N N N 
-N N N 



■ n 6 \ 



lL=ie 



- n i 



n 



i=l 

NH(p) n .-V± j 
' — A^ 



1,2, 



i=l 

..,6. 



Thus, for A" large, maximizing (1.1) under (1.2) is almost equivalent to maximizing 
the entropy 



Hip) = -2jPiln(pi) 



under the constraint 
(1.3) 

The solution has the form 
(1.4) 

where the A; must be such that 



i ■ Pi 



4.5. 



oA 4 



E<=i' 



= 4.5. 



Hence, the most probable macrostate is {Np\ : Np%, . . . , Np§) and we expect to find 
n* = Np* dice showing face i. More is true: It can be shown j27l Chapter 13] that, 
for N large, with probability close to one, other distributions satisfying (1.3) are 
close to p* . This fact is sometimes referred to as Entropy Concentration Theorem 
|65j . More generally, when Fip) := — ^2 k Pk log(pfc), the maximizer of F subject to a 
linear constraint Lp = c has the form of a Boltzmann-Gibbs distribution 



(1.5) 



where is the fcth column of the matrix L and Z a normalizing constant (partition 
function). This can of course also be formulated in the continuous setting (with 
integrals) and is also a basic result in statistics [3TJ [3H 123] ■ 
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1.2. 1931: Schrodinger's Bridges. In 1931/32, before the very foundations 
of probability were laid, Erwin Schrodinger studied the following abstract problem 
[531 154"] . Consider the evolution of a cloud of N independent Brownian particles. Here 
N is large, say of the order of Avogadro's number. This cloud of particles has been 
observed having at some initial time to & n empirical distribution equal to po(x)dx. 
At some later time t\, an empirical distribution equal to p\(x)dx is observed which 
considerably differs from what it should be according to the law of large numbers, 
namely 



p(to,y,h,x)p (y)dy dx, 



where 



p(s, y, t, x) = [2n(t - s)] 2 exp 



\x-y\ 2 
~2(t- s) 



, s < t 



is the transition density of the Wiener process. It is apparent that the particles have 
been transported in an unlikely way. But of the many unlikely ways in which this 
could have happened, which one is the most likely? Schrodinger showed that the 
solution, namely the bridge from po to p\ over Brownian motion, has at each time a 
density q that factors as q(x,t) — (p(x,t)tp(x,t), where ip and tp solve the system 



(1.6) 
(1.7) 



It took more than fifty years before Follmer, recovering Schrodinger's original moti- 
vation, observed in |49) that this is a problem of large deviation^ of the empirical 
distribution on path space |43j connected, thanks to Sanov's theorem [82 , to a maxi- 
mum entropy problem. Schrodinger's problem may be considerably generalized. Let 
:= C([to,ti],M. n ) denote the family of n-dimensional continuous functions, let W x 
denote Wiener measure on Q starting at x, and let 



(p(t,x) = I p(t,x,t 1 ,y)tp(t 1 ,y)dy, tp{t Q ,x)tp(t ,x) = p (x) 
<p(t,x)= p(t ,y,t,x)<p{t ,y)dy, tp(ti,x)<p(tx,x) = pi(x). 



W 



W^, dx 



be stationary Wiener measure. Let T> be the family of distributions on f2 that are 
equivalent to W. For Q, P G T>, we define the relative entropy D(P||Q) of P with 
respect to Q as 

D(P\\Q)=E P [\og^}, 

where dP/dQ is the Radon-Nikodym derivative of P with respect to Q. Let T>(po, p±) 
be distributions in T> having the observed densities at times £o and ti- If there is at 
least one P in T>(po,pi) such that D(P|jQ) < oo, it may be shown that there exists 



3 Largc deviations theory has various applications in hypothesis testing, rate distortion theory, etc, 
see e.g. |27l Chapter 11], 1381 . 1361 Chapters 2,3,7]. For large deviations of the empirical distribution 
(level-2 large deviations) for diffusion processes see |49l I90| (see also |79| for a recent extension of 
this theory to discrete-time classical and quantum evolutions). 
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a unique minimizer P c in 2?(po,Pi) called in the language of Csiszar the I-projection 
of Q onto V(p ,pi) [351 1301 [33J. It is the Schrddinger bridge from p a to pi over Q. 
In [35] , using a conditional version of Sanov's theorem established by Csiszar j3U] , it 
was shown that such I- projection P c provides the answer to Schrodinger's original 
question: Namely, the asymptotic empirical distribution on path space, conditioned 
that the initial and final empirical distributions are po(x)dx and pi(y)dy, respectively, 
is indeed given by P c . 

1.3. 1967: Burg's spectral estimation method. Suppose the covariance 
lags Cfc = E[y(fc)y(0)], k = 0, 1, . . . ,n— 1 of a stationary, zero-mean, Gaussian process 
have been estimated from the data. How should one extend the covariance? In 1967, 
while working on spectral estimation for geophysical data (llj . Burg suggested the 
following approach. Rather than setting the other covariance lags to zero, one should 
set them to values such that they maximize the entropy rate (see Section [5] below) of 
the process. The solution is an autoregressive process of the form 



where w is a zero- mean, Gaussian white noise sequence with variance a 2 . The pa- 
rameters a-y, . . . , a n _i, a are such that the first n covariance lags match the given 
ones. 

1.4. 1972: Dempster's covariance selection. In the seminal paper [37], a 
general strategy for completing a partially specified covariance matrix was introduced. 
Consider a zero-mean, multivariate Gaussian distribution with density 



Suppose that the elements {cr^-; 1 < i < j < n,(i,j) G Z}, with (i,i) G X for all 
i = 1 ... n, have been specified. How should £ be completed? Dempster resorts to a 
form of the Principle of Parsimony of parametric model fitting: As the elements a 13 of 
S" 1 appear as natural parameters of the model, one should set cr'-? to zero for 1 < i < 
3 < n, ^ i- Notice that o % i — has the probabilistic interpretation that the i-th 
and j-th components of the Gaussian random vector are conditionally independent 
given the other components [SB]. We say that a positive definite completion E° 
of S is a Dempster Completion if [(E°) _1 ]i .j = for all (i,j) ^ X. In particular, 
Dempster proved that when a symmetric, positive-definite completion of £ exists, 
then there exists a unique Dempster's Completion E°. This completion maximizes 
the (differential) entropy 



among zero-mean Gaussian distributions having the prescribed elements {o"ij;l < 
i < j < n, (i,j) G X}. Thus, Dempster's Completion S° solves a maximum entropy 
problem, i.e. maximizes entropy under linear constraints. 

2. Overture. The long tale of maximum entropy problems originates more than 
one hundred and thirty years ago with Boltzmann [5J at the dawn of statistical me- 
chanics. Since then, several deep thinkers such as Jaynes [BU [BS], Dempster [57] . 



n-l 





(1.8) 
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Csiszar [31) . to name but a few, have tried to explain the rationale behind the maxi- 
mum entropy approach. Yet, this method, although never presented as a panacea [55) 
p. 939], is still often viewed as an idiosyncratic choice. Before we get all tangled up 
with "predict states that can be realized by Nature in the greatest number of ways, 
while agreeing with your macroscopic information" ( Jaynes interpreting Gibbs) , apply 
the Principle of Parsimony of parametric model fitting (Dempster) or the axiomatic 
approach (Csiszar), we hasten reassure the reader: We are not going to give here even 
a precis of the motivation behind maximum entropy problems. Others have done it 
much better than we ever could. The scope of this paper is much more modest and 
yet, in a way, ambitious. 

We want to point out that behind an endless string of maximum entropy solu- 
tions there is a simple geometric principle. Namely, that a whole class of seemingly 
unrelated results concerning probability distributions, spectral densities and covari- 
ance matrices are consequences of the same variational principle. All these problems 
feature linear constraints which determine an affine subspace W in which the solution 
must be sought. Theorem 3.3 (or its generalization Theorem 9.1) simply states that 
the gradient (or a suitable generalization of it) of the entropy functional at a critical 
point must belong to the orthogonal complement (or, more generally, to the annihila- 
tor) of the subspace V of which the affine space W is a translation. Just to avoid any 
misunderstanding: We are not dealing here with the (usually challenging) existence 
problem [9) [TO] [73], [74] . We simply want to derive in the most economic way the form 
of the optimal solutions assuming that they exist. 

This orthogonality result is actually a direct consequence of a Lagrange multipliers 
argument. Nevertheless, we show that when the constraints are linear, there is no 
need to bring in our illustrious compatriot's multipliers be they vectors, matrices 
or signals. One can simply skip the step, use this universal geometric result and 
presto! the form of the optimal solution appears. How can we have a geometric 
result when probability distributions/densities and spectra naturally belong to the 
intersection of suitable cones or simplices with L 1 spaces'? The reader might look 
askance at this approach as, in general, in an infinite dimensional setting, L 1 spaces 
are not contained in L 2 spaces (one exception: absolutely summable sequences are 
also square summable). Hence, we simply don't have the Euclidean or Hilbert space 
geometry where orthogonality makes sens^] However, in many important maximum 
entropy problems, the solution together with an appropriate function of it (inverse, 
logarithm, etc.) also belongs to a suitable L 2 space (when this is not the case, see 
Section [9j a more general Banach space result may be applied). Thus, as we show, 
there is nothing to loose formulating the problem over an appropriate Hilbert space 
possibly intersected with a cone or a simplex. 

One might wonder at this point: What has this to do with the well known or- 
thogonality principle of linear quadratic optimization? Right on! Theorem 3.3 when 
applied to problems with quadratic criterion, yields well known results such as the 
orthogonality of the estimation error to the subspace generated by the available ran- 
dom variables in linear least-squares estimation. Thus, this orthogonality principle, 
a true deus ex machina, applies equally well to least-squares and entropic variational 
problems with linear constraints. Can this geometric result then be applied to any 
optimization problem in Hilbert space with linear constraints? Answer: No. The 
smoothness of the index functional is indispensable. For instance, the large and im- 



4 This might well be the very reason that our simple observation has not been made before in a 
countless number of papers on maximum entropy problems. 
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portant class of compressed sensing problems [22l EH HQ] EH EU EE EI] features as 
criteria Z 1 -type norms which do not even admit directional derivatives (they only 
admit one-sided directional derivatives as they are convex) . 

The reader might be doubtful by now: Don't the authors of this paper know 
about information geometry, I-projections and the like [25] E21 [88j 1301 EJ EU HI EE1 E 
156] EU |66l EE]? We do and are savvy enough to know that this body of work is of 
central importance in Mathematical Statistics, Information Theory, Signal Processing, 
Identification and Control. Our approach, however, is different. Rather than viewing 
the solution itself of maximum entropy problems as a projection in a suitable geometry 
and then developing a "Pythagorean Theorem for I-divergences" , our result involves 
usual orthogonality in Hilbert space (and the usual Pythagorean Theorem). Only 
that the orthogonality is a property of the differential of the entropy functional which 
does not in general relate to an "error" . In particular, our geometry does not depend 
on the particular entropic criterion employed but only on the Hilbert space in which 
the primal variables live. 

The paper is outlined as follows. In Section [3] we present our basic variational 
result. This is then applied in Sections [3] and [5] to various classical and more recent 
maximum Burg's entropy problems and in Section[6]to entropy problems with prior. In 
Section]?] we discuss maximum entropy problems on a finite measure space. In Section 
[8] we develop a new application to block-circulant covariance matrix completion when 
an a priori estimate is available. Finally, in Section [9] we give a generalization of our 
main result to Banach spaces. 

3. Maxima on surfaces. Let G : K 3 — > K be a continuously differcntiable map 
and consider the surface (level set) 5cl 3 determined by the equation 

G(x) =c, ceM. 

Since the derivative of G in the direction of a vector v tangent to the surface S must 
be zero, we get VG ■ v = 0. It namely follows the well-known fact that the gradient 
VG(xq),xq e S, is perpendicular to the plane tangent to the surface S at xq. Let 
F : M 3 — >• R be another smooth functional and suppose that we are interested in 
maximizing F over S. By the chain rule, at a local maximum point Xq, VF must be 
orthogonal to every differcntiable curve on S passing through xq. We conclude that, at 
a maximum point xq the gradient of F must also be perpendicular to the plane tangent 
to the surface S at xq and therefore aligned with VG(xo), cf. e.g. 42, pp. 101-109]. 
For instance, suppose we want to minimize F(x, y, z) — x + y + z on the surface of the 
unit sphere G(x, y, z) — x 2 + y 2 + z 2 = 1. Since at a maximum point V-F = (1,1, 1) T 
must be proportional to VG = 2(x, y, z) T , we conclude that maxima have equal 
components. It follows that the unique maximum point is Xm — {i^ 1 ^ 2 , 3 -1 / 2 , 3 -1 / 2 ) 
(X m — —Xm is the unique minimum point). 

The purpose of this paper is to show that a suitable generalization of this basic 
result is sufficient to derive the form of the optimal solution in a variety of maximum 
entropy problems. 

In maximum entropy problems, the map G is actually linear on a suitable vector 
space. Hence, S = ker G + c is an affine space, namely the translation of the subspace 
V = kerG. In this case, the geometric principle simply says that, at a maximum 
point, VF must be perpendicular to the subspace kerG. We apply this geometric 
result to a large class of Burg-entropy and Shannon-entropy [311 1321 133] variational 
problems encompassing, as special cases, Burg spectral estimation method [111 112] . 
Dempster's covariance completion [3 7) and Gibbs-like variational principles (43] . In 
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Burg's maximum entropy problems, one maximizes the (integral of the) logarithm of 
a positive quantity, be it a probability density, a spectrum or the determinant of a 
positive definite matrix, under linear constraints. The latter determine the affine space 
W. Theorem |3. 3| simply says that the Frechet differential of the entropy functional at 
a critical point must belong to the orthogonal complement of the subspace V of which 
the affine space W is a translation (coset). In the Burg's entropy case, this entails that 
the adjoint of the inverse of the solution must belong to V . In the case of Shannon 
maximum entropy problems, the orthogonality condition concerns the logarithm of 
the solution. 

Classical results can then be readily re-derived and generalized. For example, 
our result contains the key for the (considerable) recent generalizations developed in 
[521 231 S51 S71 |3H] • The case when a prior estimate is available is also covered by 
this geometric principle. In the Burg's case the entropic functional turns into a mul- 
tivariate Itakura-Saito divergence [31 [50] ■ In the Shannon case, entropy is replaced by 
the Kullback-Leibler divergence (relative entropy) |71j ). As an application, we show 
how our result can be used to extend the results of [23] to the case when a prior 
estimate of the circulant covariance is available. The latter problem deals with iden- 
tifying of the parameters of a stationary reciprocal process given the first covariance 
lags and an a priori covariance estimate. 

Let W be a Hilbert space and let F : H — > K be a functional. We say that F is 
Gateaux- differentiable at ho in direction v if the limit 

F>(h ;v):=lnn F ^ + ^- F ^ 

e->0 e 

exists. In this case, F' '(ho; v) is called the directional derivative of F at ho in direction 
v. We say that F is Frechet- differentiable at ho if there exists an element DF(ho) in 
H such that 

,. \F(h a + h) - F(h ) - (DF(h ),h) n \ 

lrm TT-r. = 0. 

NI H -*> \\h\\ n 

The element DF(ho) is called the Frechet differential of F at ho- Frechet differentia- 
bility is stronger than Gateaux differentiability. In fact, we have the following result 
[701 p.50]. 

Proposition 3.1. Let F be Frechet differentiable at h . Then, DF(h Q ) is unique 
and, for any v G H, F is Gateaux differentiable at ho in direction v and it holds 

(3.1) F'(h ;v) = (DF(h ),v) n . 

Conversely, when F is Gateaux differentiable on an open set U C H and its Gateaux 
derivative is linear and continuous at each point oflA then F is Frechet differentiable 
inlA. Finally, when F is convex, if it is Gateaux differentiable in all directions v then 
it is Frechet differentiable. 

In some applications, we cannot expect that the functional be Frechet differen- 



tiable at the point of interest. We may, however, have that a formula like (3.1 ) holds 
when v varies over a subspace. More precisely, let V C H be a (not necessarily closed) 
subspace and h € H. Consider the corresponding coset W := h + V which is an affine 
space over V. Observe that, for w <E W and v € V, (it) + ev) € VV for all for all real e,, 
namely w is an internal point of W in direction v. 

Definition 3.2. We say that w c is a critical point of F over W = h + V if 
F'(w c ;v) = for all v e V. 
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Theorem 3.3. Let W := h + V be an affine space. Assume that the functional 
F is Gateaux- differentiable at w c £ W m any direction v £ V and that the Gateaux 
differential is given by the linear, continuous map F'(w c ;v) = (DyF(w c ),v)ii where 
DyF(w c ) £ H . Then w c is a critical point of F overW if and only if DyF(w c ) £ V ■ 
When F is actually Frechet differentiable at w c £ W, w c is critical if and only if 
DF{w c ) £ V- 1 . 

Proof F'(w c ; v) = for all v £ V if and only if {D v F{w c ),v) n = 0, Vv £ V. □ 
4. Matricial variational problems. 

4.1. Geometric result. Let H = C nxn (or H — R nxn ) be the space of n x n 
matrices endowed with the inner product (M\,M.2) '■= tr[A/*A/2], where * denotes 
transposition plus conjugation (we write Af — * for (A/ -1 )*). The following result was 
established in 46J. 

Lemma 4.1. Let 



(4.1) F(M) := log |det [M}\. 
If M is nonsingular then, for all SM £ % 

(4.2) F'(M; SM) = tr [M^SM] = (M~*,SM). 



It now follows from Proposition |3.1| that F is Frechet differentiable in the open set 
of non-singular matrices and 

(4.3) DF(M)=M~*. 



We are interested in extremizing (4.1 ) over an affine space, namely a coset of the 
form W = A + V, where A £ H and V is a subspace of H. 

Theorem 4.2. Let W = A + V be an affine space. Then a nonsingular matrix 
M c £ W extremizes F(M) = log |det [M]\ over VV if and only if M~* £ V 



± 



Proof. Let M c £ W be non-singular. By (4.3), we have DF{M C ) = M~*. The 



conclusion now follows from Theorem 13.31 □ 



4.2. Dempster's covariance selection. In various applications, index (4.1 1 
must be extremized (or rather maximized) on the intersection between an affine space 
W and a convex cone. A typical example is that of the cone of positive semidefinite 
matrices. This is the case considered by Dempster in the seminal paper [37j where a 
general strategy for completing a partially specified covariance matrix was introduced. 
We now show that Theorem |4 . 2| provides a geometrical interpretation of one of the key 
features of Dempster's result. To see this, consider the Dempster's problem with the 



same notation as in Subsection 1.4 Let W be the affine space of symmetric matrices 
having elements {<?ij;l < i < j < n,(i,j) £ 1}. Notice that W is affine over the 
subspace V of symmetric matrices having zeros in the positions X. Observe next that 
the solution S is constrained to be in the intersection between W and the convex 



cone of positive definite matrices. On this set, maximizing the index (4.1) or the 



entropy (1.8) is equivalent. Thus, the two criteria yield the same solution. Moreover, 
(i,i) £ X for all i = 1 . . . n, i.e. are all fixed so that < an d 

hence the feasible set is bounded. Finally, as S tends to be singular, i.e. it approaches 
the boundary of the cone, H(jp) tends to — oo which implies that the solution can be 
searched among positive definite matrices. Thus, under the feasibility assumption, the 
optimal solution exists and lies in the interior of the cone. We can then repeat locally 
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the argument of Theorem |4.2| to conclude that the maximum entropy completion S c 
is such that E^ 1 G V . Finally, observe that V is the space of matrices having zeros 
in I, the complement of X. Indeed, let denote the ?-th canonical vector in W 1 and 
observe that for G I, the rank one matrix e^ej belongs to V. If M € V , we 
must have 

= tr [( ei ej) T M] = tr [e 3 ej M] = ej Me 3 = [M] i3 , V(i, j) G T. 

Thus, the maximum entropy completion S c is a Dempster's completion. 

4.3. General matrix completion. In [46], Dempster's completions where shown 
to solve suitable entropy-like variational problems for general nonsingular matrices^] 
Again, the form of the extremal completions (no uniqueness is there guaranteed) when 



they exist is provided by Theorem 4.2 
5. Matricial functions. 

5.1. The orthogonality result. Consider now the Hilbert space % of square 
integrable functions taking values in the space of m x m Hermitian matrices. We 
denote by H„ the n 2 -dimensional, real vector space of Hermitian matrices of dimension 
n x n. Hence, % — L 2 (T,H m ) with scalar product 



(*,*)« := ^- j tr [^(e^)^(e^)] dd. 



Consider the functional 



(5.1) = - 1 - / log|det[$(e J,? )]|^. 

2lT J-7T 



Lemma 5.1. Suppose $ € L°° (T, H„) is coerciv^\ Then, for any <5$ € L°°(T, H„) 



the directional derivative of (5.1) exists and is given by the linear map 
(5.2) F'($; £$) = — ( tr [fc" 1 (e j °)6$(e>°)]d0 = (S -1 , S$) n . 

271 " J-7T 

Proof. Observe that, for S<£> G L°°(T,H„) and |e| sufficiently small, $(e^) + 
e(5$(e : ''') is a.e. positive definite. After bringing the derivative under the integral 
sign, we can use Lemma [4~l] for almost all □ 



Let W = A + V be an affine space in L°°(T,H„), namely A G L°°(T,H n ) and V 
is a subspace of L°°(T,M. n ). 
Then Theorem |3 . 3| yields : 

Theorem 5.2. Let W = A + V be as above and Q^e 3 "®)^ W be coercive. Then, 
if $ c is a critical point of (5.1) over W, we have G V~|J . 



Proof. By Lemma 5.1 if such a <3? c extremizes (5.1), then ($ c 1 ,v)n = for all 
DEV. Namely, $ t T 1 G □ 



5 Actually, the case of full-rank rectangular matrices, with the Moore-Penrose pseudoinverse in 
place of the inverse, was also treated in |46| . 

6 <E> is called coercive if 3 a > s.t. &(e^) — al m is a.e. positive definite on T. 

7 For a not necessarily closed subspace V of L 2 (T,H n ), the orthogonal complement V 1 - is the 
closed subspace of u £ L 2 (T, H n ) such that (u, v) L i = 0, Vv 6 V. 
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5.2. Burg's maximum entropy covariance extension. In his seminal work 
[TTl [T2"] , Burg introduced a spectral estimation method based on the maximization of 
entropy which is widely used in signal processing. We now show that Theorem [5]2] pro- 
vides a most transparent reason why the solution has to be an AR process. Consider 
a discrete-time Gaussian process k £ Z} taking values in R m . Let Y[-„.„] be the 
random vector obtained by considering the window y~ n , y- n +i, ■ • • )J/o 5 • • • j Vn-iiVn) 
and let Py,_„ n] denote the corresponding joint density. The (differential) entropy rate 
of y is defined by 



(5.3) 



h r {y) 



1 



lim 

n— s-oo 2n - 



1 



H(p Yl _ n , n] ), 



if the limit exists, where H(py, 



cf. (1 



vector Y[_ n ,„], 
result. 

Theorem 5.3 



In 



, ) denotes the entropy of the density of the random 
, Kolmogorov established the following important 



Let y = {yk\ k £ Z} be a W 11 -valued, zero-mean, Gaussian, 



stationary, purely nondeterministic of full rank process with spectral density $.„. The 



(5.4) h r (y) = ™ log(2^e) + ~ j* logdet 9 y (e?*)d0. 

As is well-known, there is also a fundamental connection between the quantity ap- 



pearing in (5.4) and the optimal one-step-ahead predictor: The multivariate Szego- 



Kolmogorov formula reads 



(5.5) 



det R = exp 



1 

2n 



logdet <$> y {^)dd 



where R is the error covariance matrix corresponding to the optimal predictor. Con- 
sider now the multivariate covariance extension problem. Let Ck,k — 0, 1, . . . ,n — 1 
of dimension m x to be some estimated covariance lags of an unknown stationary pro- 
cess y. Then Burg's problem consists in finding a stationary process y with spectral 
density <fr y which maximizes the index 



(5.6) 



i 

2^ 



logdet Qy^dd 



among all spectral densities having as first n Fourier coefficients Cfc, k = 0, 1, 



1. 



In view of Kolmogorov's result (5.4), maximizing the entropy rate of a stationary 
Gaussian process is equivalent to maximizing the integral of logdet Assume that 
the block- Toeplitz matrix £„ 



(5.7) 



En. — 



Co 



Co 



C,j 

c„ 



Co 



is positive definite. Then 
Fourier coefficients. 



there are infinitely many spectra having the prescribed 



"Actually, the solution to this problem maximizes the entropy rate in the larger class of second- 
order processes |26| . 
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Consider now the matrix pseudo-polynomial P(e J ) 



yn-i c jtffc 



with 



C£, and define the subspace V n of L°°(T,H n ) of functions whose Fourier 



C-k — 

coefficients Ri vanish for all i = —n + 1, ... n — 1 and obey to the symmetry constraint 
Ri = R*_ { . Then the constraint in Burg's problem can be expressed as $ £ W n S, 
where the affine space W is defined by 

W = P + V n 



and S is the convex cone of bounded, coercive spectral densities. On 5, (5.1) and ( 5.6 1 



coincide, and F is stric tly concave. Thus, an extremizer $ c is actually a maximum 
point. By Theorem 5.2 this maximum point <& c is such that G V„ . Observe now 



that V„ is given by the matricial polynomials of the form 



E 

k— — n J r 



A k e 



-i-dk 



We conclude that the optimal spectrum has the form 



At 



(5.8) 



* c (ei*) 



,fe=-n+l 



^°- fe = (^fc)* 



for some matrices AS, fc = — n + 1, . . . , 0, . . . , n — 1 which permit to satisfy the con- 
straints on the first n coefficients. Thus, the solution process is an AR process. If only 
some of the C k , k = 0, 1, . . . , n — 1 are available, the classical approach to the problem 
requires a certain effort and some ad hoc reasoning to get the solution form. Theorem 
5.2 on the contrary, yields immediately that in (5.8) A° k — for all k corresponding 
to missing GVs. 

5.3. A more general moment problem. We consider next a generalization 
of Burg's problem studied by Byrnes, Georgiou and Lindquist and co-workers [161 
HU [17\ [50l [53l [571 [5TJ [TBI 02] m the frame of generalized moment problems. In 
their broad research effort, having applications, besides spectral estimation, to robust 
control problems, elements of a parametric family of rational spectral densities were 
recognized from the start |16l I15j to be critical points of logarithmic entropy-like 
functionals. 

Consider a transfer function 



(5.9) 



G(z) = (zl - Ay 1 B, AeC nxn ,Be 



n > to, 



where A has all its eigenvalues in the open unit disk, B has full column rank, and 
(A, B) is a reachable pahj^j Suppose G{z) models a bank of filters fed by a wide sense 
stationary, purely nondeterministic, C m -valued process y: 



y(t) 




x{t) 




G{z) 









9 A pair (A, B) is called reachable in Systems Theory [57] if the matrix [B | AB \ ... \ A n ~ 1 B] 
has full row rank. 
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Let x be the n-dimensional stationary output process 



(5.10) 



x k +i = Ax k + By k , k e Z. 



We denote by E the covariance of x k ■ The spectrum $ must then satisfy the following 
moment constraint 



(5.11) 



1 

2^ 



G(e jl5 )$(e J *)G*(e J,> )cM 



As in [17, 57, 51, 45, 48 , we now consider the problem of determining spectral densities 
<!> satisfying (5.111 for a given E > 0. The covariance extension is a special case 
of this problem corresponding to G(z) := [z~ n I | z~ n+1 I \ . . . \ z _1 /] T and E 
equal to the Toeplitz matrix in (5.7). More details on this fact may be found in [57] 



where other classical problems are shown to be special cases of the above. The most 
important of these problems is the celebrated Nevanlinna-Pick interpolation problem 
of fundamental importance in various H°° control problems [HJ [TBI [3 [5H] . 

We now show how to treat this problem in our geometric framework. Let, as 
before, T~L — L 2 (T, H m ). Consider now the linear operator 



T : L°°(T,I 



(5.12) 



2tt 



G{^)^{^)G*{e^)d'd. 



It follows that for the constraint (5.11) to be feasible, E must belong to the linear 
space 



(e^)G*(e^)di} = M 



RangcT := |m G H„|3$ € L°°(T,H m ) such that ^- J G(e j, ')$( 
(5-13) 

Consider now the following generalization of Burg's problem: Maximize the entropy 



index (5.6) subject to (5.11) where E is assumed to be positive definite. Suppose that 



5.11) is feasible, namely there exists a spectral density $o € L°°(T , H m ) satisfying 



this constraint. Then, the family W of hermitian-valued functions satisfying (5.11) 
may be expressed as 



W = $ 



o 



V, 



where V = {$ e L°° (T, H m )| J G^G* = 0}. In other words, V = kerT. The con- 
straint in the generalized Burg problem can be expressed as $ e W D S, where S is 
the convex cone of bounded, coercive spectral densities. Since 



dd 

G ^G* _,M) H „ :=tr 

Z7T 



G$G*—M 
2tt 



= tr 



<&G*MG 



di) 
'2tt 



= ($,G*MG) H 



we have that the adjoint of T, mapping H„ to L°°(T, H m ), is given by 
(5.14) T* : M i y G*MG. 

In particular, Range T* = {$ = G*MG,M E H n } C C(T,H m ) the continuous 
Hermitian-valued functions on the unit circle. Since Range T* is finite-dimensional, it 
is necessarily closed and we have 



(5.15) 



[ker Tp = Range T* = {$ = G*MG, M G H n }. 
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By Theorem 5.2 the maximum point <& c is such that <£> c 1 € V . Hence, the optimal 



spectrum has the form 

(5.16) S c (e*) = [G(e^)*A c G(e^)] -1 , 



for some Hermitian A c such that G(e? )* K c G{e^) > on T and the constraint (5.11 1 
is satisfied, namely 

G[G*A C G]- 1 G*— = Z. 

Indeed, Georgiou showed in |52j that the unique solution of the generalized Burg 
problem has the form (5.16) with 



(5.17) 



A, 



6. Variational entropy problems with "prior". 

6.1. Matricial problems. Consider now the same set up as in Section|4j where 
a "prior" nonsingular estimate N of the matrix M is available. Rather than extrem- 
izing (maximizing) (4.1 1, we now consider the problem of finding a matrix belonging 
to the given affine set W and which extremizes the index 

(6.1) F{M) := log |det [N}\ - log |det [M]\ + tr (N^M) 



(see below for insights and motivation for this choice). Lemma |4 . 1 1 now becomes: 

LEMMA 6.1. Let F(M) be given by (6.1\ ). If M is nonsingular then for any 

sm en = c nxn , 



(6.2) 

and DF(M) = —M~ 



F'(M; SM) = tr [(-M _1 + A^ 1 ) SM], 



By Theorem 3.3 we get 



Theorem 6.2. Let W = A + V be an affine set in V. = 



Let A be a 



nonsingular matrix in H. Then the nonsingular matrix M c £ W extremizes (6.1) 
over W if and only if (M~* - A~*) <E V 1 - . 



In order to motivate the choice ( 6.1 ), we first recall a few basic facts on entropy for 



Gaussian random random vectors and processes that may be found e.g. in [801 163} 127] . 
The relative entropy or Kullback-Leibler pseudo-distance or divergence between two 
probability densities p and q, with the support of p contained in the support of q, is 
defined by 



(6.3) 



B(p\\q): 



q(x) 



see e.g [27]- In the case of two zero-mean Gaussian densities p and q with positive 
definite covariance matrices M and A, respectively, the relative entropy is given by: 



(6.4) 



q) = i [logdet (M^N) + tr (N^M) - 



Hence, when A^ and M are positive definite, minimizing index (6.1) is indeed equiv- 
alent to minimizing the Kullback-Leibler divergence between two Gaussian random 
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vectors which is one of the central problems in statistical modeling. Indeed, as is well- 



known, (6.4), originates from maximum likelihood considerations, cf. e.g. [131 Section 
II] . An important application of this result is the estimation of a structured covariance 
matrix. In this class of problems we need to estimate a covariance matrix E in such 
a way that it satisfies some linear constraints. The sample covariance estimate E will 
normally fail to satisfy the given linear constraints so that the problem of computing 
E c that satisfies the constraints and is as close as possible to E, arises naturally, see 
[T51 IM1 H71 175] for more details and applications. In particular, in [37] the constraint 



is given by E £ Ranger (as defined in (5.13)). It was shown there (Proposition 3.2) 
that 

(6.5) V = {E: (7-n B )(E-^SA*)(/-n B ) = 0}, 

with II B being the orthogonal projection onto im (B), so that it is easy to see that 

(6.6) V x = {A = (7 - IT B )A(J - n B ) - A* {I - n B )A(7 - U B )A : A e H„}. 



Then, Theorem |6.2| can be used to get in a straightforward manner the form of the 
optimal E c presented in [47j Section IV]: 

(6.7) e c = (ir 1 + {i - n B )A(/ - n B ) - a* (i - n B )A(j - n B ) a) 1 , a e h„. 

6.2. Matricial functions problems with "prior". As much as Theorem|4.2| 



also Theorem 6.2 may be generalized to the case when T~L = L 2 (T, H„). In this setting, 
we consider $ £ L°°(T, H n ) coercive and a given "prior" '5 also essentially bounded 
and coercive. The index to be extremized is 



(6.8) 



F($, = / {log(det *) - log(det $) + tr [*" 1 $] } d<&. 
2tt J_„ 



Motivation for considering this index will be provided after the statement of the next 
result. A straightforward generalization of Lemma |6.1| and Theorem |3.3| now give a 
result germane to Theorem |6.2| 

Theorem 6.3. Let H be as before L 2 (T, H„) and letW = A + V be an affine set 



in L°°(T,IHLi) and $> c (e^) £ W be coercive. Then $ c extremizes (6.8) overW if and 
only if (^c 1 - £ V- 1 . 



To provide some motivation and insight for index (6.8), we consider two zero-mean, 
jointly Gaussian, stationary, purely nondeterministic processes y — {yk', k £ Z} and 
z = {zk] k £ Z} taking values in M. m . We consider the relative entropy rate V> r (y\\z) 
between y and z defined as 

(6.9) O r (y\\z) := lim — — — B(p Y , Mpz, ,) if the limit exists 

n-i-oo 2n + 1 1 ™'" J 1 "'™ J 

where J3y ( _ n , and Pz^ n n] are the densities of the random vectors obtained from y 
and z, respectively, by considering the "windows" from time —n to time n. Following 
in his mentor's footsteps, the great information theorist M. Pinsker [SO] proved the 
following important result (see also [87 ] 163 ] 177]): 

Theorem 6.4. Let y — {y k ; k £ Z} and z = {z k ; k £ Z} be R m -valued, zero- 
mean, Gaussian, stationary, purely nondeterministic processes with spectral density 
functions $ a and$ z , respectively. Assume, moreover, that at least one of the following 
conditions is satisfied: 
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1. Q y Q z 1 is bounded: 

2. $j, € L 2 (— 7T, 7r) one? $ z is coercive. 



TTie 



(6.10) 



The index (6.101 has the form of a multivariate Itakura-Saito divergence of speech 
processing [501 H] and is basically the same as (6.8). Indeed, one of the main results of 
[48j is based on the minimization of (6.8) where \l/ is a given "prior" spectral density 



and $ must belong to the intersection between the cone S of positive definite spectral 



densities and the affine set W of the solutions of the moment problem (5.11 1, for given 



G and E. Since the constraint is as before, so are the spaces W and V. In particular, 
we have 

V 1 - = {$ = G*MG, M € H„}. 



By Theorem 6.3 we get the form of the optimal spectrum derived in |48j 



4V 



G*A C G] 



A c e 



where A c permits to satisfy (5.11) 



6.3. Kullback-Leibler approximation of spectral densities. Consider the 



same set up as in Subsection 5.3 in the scalar case (m = 1) when an a priori estimate 
of the spectrum 4" is available. The latter is assumed to be essentially bounded and 
coercive. In |57j . the following constrained approximation problem was studied: Min- 
imize F(Q) = D(^||$) = / log (*/$) * among coercive spectra <f> € L°°(T) satisfying 



(5.11). Notice that minimization occurs with respect to the second argument. This 
permits to include the maximum entropy in this framework (\E' = 1) and to obtain 
a rational solution rather than in the exponential class when \I> is rational. Further 
justification for this choice of the criterion may be found in [57]. I n thi s case, for 
5$ £ L°°, F'( $: Sip) = -((j) -1 ^, 5p) L 2. Since the constraint is as in ( |5.1l| ), so is the 



the form obtained in [57] 



space V x , see (5.15). By Theorem 3.3 we conclude that the optimal spectrum has 



:r0) 



G*{e^)K c G{e^Y 



A c e 



The difficulties of extending this result to the multivariable case are illustrated in [55] 
p.1062]. 

7. Shannon entropy for finite measure spaces. The Shannon entropy un- 
derlying all the criteria so far considered will be here addressed directly via the first 
(rather than the second) part of equation (1.8) and with a finite measure fi replacing 



Lebesgue measure. Let (X, X,/S) be a finite measure space and let (pi,i = l,...,d 
be functions in T~L = L 2 (X,X,/j,) and a € M. d . Consider the problem of finding a 
nonnegative function p in L°°(X, X,fi) maximizing the Shannon entropy 



(7.1) 



F(p) = H(p) = - [ \og\p{x)]p{x)dn 
J x 
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under the constraints 



(7.2) 
(7.3) 



A" 



p(x)dfi = 1, 

tpi(x)p(x)dfj, = a it i = l,...,d. 



x 



Lemma 5.1 can be readily adapted to this setting. Let p c <E L°°(X, X, fj,) be nonnega- 
tive and bounded away from zero fi a.e. Let Sp € L°°(X, X , /x). Then the directional 
derivative of the functional (7.1 ) in direction Sp exists at p c and is given by 



F'(p c ; Sp) = / [-1 + logp c (a;)] Sp(x)d^i = (-1 + logp c , 5p) H . 
Jx 

Let us show that the fundamental geometric result Theorem |3.3| provides the form of 
the extremal solution also in this case. Suppose there exists po € L°°(X, X,fi) a.e. 
everywhere positive satisfying ( 7.2 )-( 7.3 1 . Then p £ L°°(X,X,fj,) also satisfies the 
constraints if it belongs to the affine space Po + V where V is the subspace of functions 
/ € L°°(X, X, fi) such that 



(7.4) 
(7.5) 



A 



f(x)dn = 0, 

ifi(x)f(x)diJ, = 0, i = l,. 



x 



Observe now that V ± is the subspace of functions of the form $ + J2i=i $i<Pi( x )- 
Observe also that for p c bounded and bounded away from zero as above, logp c also 
belongs to L co (X,X,/j,) and, consequently, to L 2 (X, X, /x). By Theorem 3.3 we con- 
clude that (— 1 + log Pc ) G V , it must namely be of the form 



(7.6) 



p c (x) = Cexp 



y^$iipi(x) 



for some values C and i = 1, . . . , d that permit to satisfy the constraints. This is 
just the well-known fact that, if the maximizer exists, it belongs to the exponential 
family. In the case when d = 1 and ipi = H the Hamiltonian function, we get a baby 
version of Gibbs variational principle, namely that the Gibbs distribution 



Pg(x) = Cexp 



H(x) 
kT 



minimizes the free energy (H,p) — kTF(p) where F is as in (7.1 1, k is Boltzmann's 
constant and T is absolute temperature. [43] . 

The well know fact that among all probability densities with given mean and 
variance the Gaussian has maximum entropy can also be derived in this framework 
by taking as "reference" measure \i a Gaussian measure. 

8. Reciprocal processes identification with prior. In this section, we con- 
sider the problem of block-circulant covariance completion addressed in 23, 24J and 
we show that our result allows for a direct solution of this more general problem 
also in the case (not considered there) when a prior estimate is available. The above 
mentioned block-circulant covariance completion is equivalent to the computation of 
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the parameters of a stationary reciprocal process of order n denned on the discrete 
circle Z/iVZ. A process y(t) defined on Z/iVZ is reciprocal if it enjoys the following 
property. Take any two points i,j € Z/iVZ: They divide the discrete circle into two 
(discrete) arcs. Then process y(t) is reciprocal of order 1 if y(t) and y(r) are condi- 
tionally independent given y(i) and y(j), for any i,j and for any t and r belonging to 
different arcs. The process y(t) is reciprocal of order n if y(t) and y(r) are condition- 
ally independent given y(i),y(i + l), ...y(i + n-l) and y(j),y{j + l), . ..y(J + n-l), 
for any i,j and for any t and r belonging to different arcs. Reciprocal processes de- 
fined on (a finite interval of) the integer line can be seen as a special class of discrete 
Markov random fields restricted to one dimension. Stationary reciprocal processes 
defined on Z/iVZ are potentially useful for describing signals which naturally live in 
a finite region of the time (or space) line such as texture images. 

Let S{ € R mxm , i = 0, 1, ... ,ri be given. In [53] the problem has been considered 
to compute the parameters of a stationary reciprocal process of order n defined on the 
discrete circle Z/7VZ such that the first n + 1 covariance lags of this process match 
the given i = 0, 1, . . . , n. For the importance and applications of this problem we 
refer to [23] and references therein. For a discussion of stationary reciprocal processes, 
we refer to |75j . In |23j is was shown that this problem is equivalent to compute an 
extension £j <E IR mxm , i = n + l,n + 2, . . . , N — 1 in such a way that the symmetric 



block- Toeplitz matrix £ whose first block row is [Eq 
(8.1) F(E) := log[det [£]] 



maximizes 



in the set WP\S, where S is the cone of positive definite matrices and W is the affinc 
space of block-circulant symmetric matrices such that the north-west corner block of 
dimension m(n + 1) x m(n + 1) is equal to the symmetric block- Toeplitz matrix En 
whose first block row is [£o I E^ I ■ ■ • E„]. The form of solution to this problem may 
be easily computed by using Theorem |4.2[ In fact, define 








I ni 
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Im 
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Im • • ■ 












Im ■ • ■ 





u = 










G x ^ m 


E = 





o '•• 












o ... 













Im 









o ... 












... 





where I m 


denotes the m x 


m identity matrix. 


Clearly, U T U 


= UU T = 


ImNy 



iNmx(n+l)m 



U is 



orthogonal. Note that a matrix C with N x N blocks is block-circulant if and only if 
it commutes with U, namely if and only if it satisfies 

(8.2) U T CU = C. 
The affine set W may be then characterized as 

(8.3) W = {£ = E T : E T EE = En, [/ T EC7 = £} = A 
with A eW and 

(8.4) V := {£ = E T : 
It is not difficult to check that 

EAE T + U@U T — 6, A 



V 



E T Y,E 



V 1 - = {A 
(8.5) 



A 



= 0, C/ T £[/ = E}. 

. (n+l)m x (n+l)m q 



pNmxJVml 
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Hence the optimal solution, if it exists, has the form 

(8.6) e c = (eae t + ueu T -ey 1 , 

where A = A T G R(n+i)mx(n+i)m j and Q = T g M 7VmxJVm mugt be choscn in such 
a way that the constraints are satisfied. This can be done through convex duality as 
discussed in |23j . The dual problem consists here in the unconstrained maximization 
of the concave function 

L(A, 9) = tr log (EAE T + UOU T - 9) + trJ - tr (AE U ) . 

over a suitable set of multiplier pairs (A, 9 ). Once the optimal parameters A and 9 



have been found, the optimal solution (8.6 1 has inverse E c which is a block-circulant 



matrix whose first block-row has the form 

(8.7) [M | M l | ... | M n | | | ... | | AfJ I M^_ x | . . . | Mj], 

where the matrices Mi are the sought for parameters of the stationary reciprocal 
process. 

We now address the case when a prior information is available in terms of the 
parameters of a reciprocal process (possibly of higher order), or, equivalently of a 
prior positive definite covariance matrix E p G ^NmxNm^ j n ^ s cas6j i ns tead of 



maximizing (8.1) we minimize the divergence (see ( |6.4[ )) 
(8.8) F(E) := [log det (E~%) + tr (E" X E) 



under the same constraints. By employing Theorem |6.2| we get the form of the 
optimal solution is 

(8.9) E c = {EKE 1 + UOU T - 9 + E" x ) _1 , 

where, again, A = A T G fK 1 )™^^ 1 )-", an d 9 = 9 T G R NmxNm must be chosen 
in such a way that the constraints are satisfied. As before, this can be done by solving 



a dual problem for which existence can be proven along the lines of [33] ■ From (8.9 ) it 
follows that when E p is also the covariance matrix of a stationary reciprocal process of 
order n or less, the optimal solution is also reciprocal of order n and coincides with the 
optimal solution of the problem without prior! This remarkable result follows from 



.9 ) and the fact that there exists a unique block circulant covariance completion 



satisfying the linear constraints and having block zeros in the first row as in (8.7). If 
instead, E p is the covariance matrix of a stationary reciprocal process of order n\ > n 
(requiring a larger memory), then the optimal solution is the covariance of a reciprocal 
process of order n\ whose parameters may be read in the first block-row of Ej 1 . 

9. Extension to functionals defined on a Banach space. In some applica- 



tions, Theorem 3.3 does not suffice. For this reason, we mention the straightforward 
extension of our main result to functionals F defined on a Banach space. Let X be 
a Banach space and let F : X — > K be a functional. We say that F is Gateaux- 
differentiable at xq in direction v if the limit 

T?r, \ y F(x + ev) - F(x ) 
b (Xo;v) :— hm 

e— ¥0 € 
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exists. In this case, F'(xq; v) is called the directional derivative of F at xo in direction 
v. We say that F is Frechet-differentiable at xq if there exists a bounded linear 
functional on X DF Xg such that 

Hm \F(x + x)-F(x )-DF Xo (h)\ ^ Q 
NI*->o \\x\\x 

The functional DF XQ is called the Frechet differential of F at Xq. Again, if F is 
Frechet diffcrentiablc at xq, then DF XQ is unique and, for any x X, F is Gateaux 
diffcrcntiable at xq in direction v and it holds 

(9.1) F'(x ;v) = DF X0 (v). 

Theorem 9.1. Let X be a Banach space, let V C X be a subspace, let x G X 
and consider the corresponding coset VV := x + V. Assume that the functional F is 
Frechet-differentiable at w c G W. Then w c is a critical point of F over W if and only 
if DF Wa belongs to the annihilator of V. 

Proof. Observe that F'(w c ; v) = for all v G V if and only if DF Wc (v) = 0, Vw G V. 

□ 



When F is not Frechet-differentiable at w c but merely Gateaux differentiable 
in directions varying in a subspace, a generalization such as in Theorem |3.3| can be 
established. 

10. Closing comments. In this paper, we have established a simple orthogo- 
nality condition that allows to derive the form of the optimal solution in a plethora of 
maximum entropy problems. We feel that this geometric condition affords a consid- 
erable conceptual simplification allowing to cast least-squares and maximum entropy 
problems in the same framework (admittedly, not as deep as the one provided in 
[3"T]). It can, moreover, be readily generalized to abstract situations and to problems 
with nonlinear constraints. Further study is needed to see whether this approach may 



be suitably adapted to the abstract setting of Subsection |1.2| A suitable mixture 
of the geometry we have seen in Burg's and in Dempster's problems in Subsections 
|5.2| and |4.2| might provide the key to understanding AR and ARMA Identification of 
Graphical Models, a topic which has recently received considerable attention, see e.g. 
[721 IM1 1521 1551 [3] . Finally, we should never forget the motto over the entrance to 
Plato's Academy: " 'A^euj^teT prjro^ [inStis etofiTui" , namely "Let no one untrained in 
geometry enter." 
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